A Machine Learning Approach to Predicting House Prices Using Advanced Regression Techniques

ames-iowa-skyline-color-buildings-isolated-white-backgro-ames-iowa-skyline-color-buildings-isolated-white-101534629.jpg

Introduction

Kaggle has launched a competitive competition that aims not so much to get a monetary prize as to help spread the science of machine learning and open many areas of discussion among scientific researchers. In this competition, we will attempt to arrive at an optimal machine learning model to predict home prices in Ames, Iowa, using various advanced regression techniques.

(Here we analyze the data for The period from 2006 to 2010)

The Structure Dataset?

The data is divided into two groups:

  • The training dataset contains 79 explanatory variables that describe (nearly) every aspect of a residential home, in addition to the sale price and ID variables. So the total (81 variables)
  • The test dataset contains 79 explanatory variables that describe (almost) every aspect of residential homes, in addition to ID. So the total is (80 variables).

The target is to predict the appropriate selling price for each house in the test data.

Problem description

The main goal here is to predict the final price of each house for each identifier in the test data set, by predicting the value of the selling price variable, which is not a simple or easy problem due to the multiplicity of features that can affect the test, in addition to the presence of some problems in the data The most important of which are missing values, outliers, and skewness of the distribution for some features.

Import Libraries

Define functions

Upload Data

Preprocessing Training Data

Data Wrangling

Each data project requires a unique approach to ensuring that its final data set is reliable, accessible, and easier to analyze.

General Properties

Observations from above dataset are:

    Looking at the displayed information that represents the training data for predicting house prices, we notice that there are many data missing for some variables, which require processing or deletion.</li>

Split Features

Numerical and Categorical features

Data Cleaning

Information That We Need To Delete Or Modify

  1. Exploring the missing values
  2. Handling missing values
  3. Check duplicate rows from the dataset
  4. Remove the unused colums that are not needes in the analysis process.

1- Exploring the missing values

  • Using a heatmap to quickly identify missing data in each variable, missing data is checked and identified on variables that need to be processed or deleted.
  • The number of missing data in each variable in the training data
  • We can quantify the missing data in each variable in a more explicit manner using a bar chart.
  • It is also possible to indicate the percentages of missing data for each variable.

Observations from above dataset are:

    Each variable (PoolQC, MiscFeature, Alley, Fence) contains approx more than 80% missing data, which means that keeping them will not benefit the analysis with anything. On the contrary, if they are processed and kept, they may lead to misleading results, so we will, in the next step, delete them.</li>

2- Handling missing values

  • Drop variables that contain missing values over 80%
  • Filling Missing Values with (Mean - Mod).
  • The number of missing data in each variable in the training data After deleting the columns of the variables (PoolQC, MiscFeature, Alley, Fence), Because they contain more than 80% missing data.
  • The number of missing data in each variable in the training data
  • Filling Missing Values with (Mean - Mod) according to the variable type

3. Check duplicate rows from the dataset

4. Remove the unused colums that are not needes in the analysis process.

Exploratory Data Analysis

Through approximately 79 variables that are supposed to have an impact on the selling price of the house, we will focus here on examining the relationship of these variables with the selling price variable by asking and answering many questions using mathematical and statistical operations, and creating visualizations aimed at addressing the research questions we raised in the introduction section.

Exploring the relationship between Numerical features and the seleprice

Here we are interested in knowing the shape of the data distribution and whether there is a correlation between the sale price and the numerical variables.

Correlation

Observations from above dataset are:

    With a first look at the form of the relationship and the distribution and spread of data between the numerical variables and the selling price variable, we find that it is divided into:

    • Defective variables: >

      In these variables, we notice a clear correlation between them and the selling price variable, but we also notice a deviation in the form of data distribution so that most of the data is concentrated in one aspect, which makes this relationship defective.

      Reason:

      The existence of outliers that could have an impact on this relationship

      Required treatment:

      Process outliers by deleting them or by processing them with an interquartile range (IQR).

      Variable names:

      (LotArea - MasVnrArea - BsmtFinSF1 - TotalBsmtSF - OpenPorchSF)

  • Strong Relationship: >

    With regard to these variables, we find that the relationship between them and the selling price variable is strong, and the shape of the data distribution is good and very close to the normal shape, but there are also some outliers that may have an effect on this correlation.

    Required treatment:

    Process outliers by deleting them or by processing them with an interquartile range (IQR).

    Variable names:

    (GrLivArea- GarageArea - GarageYrBlt - YearBuilt - YearRemodAdd - WoodDeckSF - BsmtUnfSFt)

  • Separate variables with a strong relationship and influence: >

    We note here that there is a strong and influential relationship between each of these variables and the selling price variable, but what is new here is that although these variables are numeric, they are represented in the form of categorical variables.

    Reason:

    The values of these variables are discrete values.

    Here we have an important note:

    That each unit increase in the discrete variable indicates the same amount of increase in the variable we want to measure (sale price). As long as this is true, the regression coefficient for that independent variable will make sense, because you can legitimately interpret it as the slope of the regression line.

    Variable names:

    (OverallQual - FullBath - TotRmsAbvGrd - BsmtFullBath - HalfBath - BedroomAbvGr - KitchenAbvGr - Fireplaces - GarageCars)

  • Weak and stable relationship at a certain limit: >

    Where we find in this set of variables that the relationship between them and the selling price variable is constant, no matter what happens in these variables.

    Required treatment: We wait for the degree of correlation to be measured.

    Variable names:

    (YrSold - MSSubClass - BsmtHalfBath - MoSold - LowQualFinSF - MiscVal - PoolArea - ScreenPorch - 3SsnPorch - EnclosedPorch)

Categorical Features

The relationship between selling price and categorical Features

Handling outliers

Removing outliers is important step in data analysis. However, while removing outliers in ML we should be careful, because we do not know if there are not any outliers in test set.

SalePrice

Normal distribution test SalePrice data

skewness and kurtosis for SalePrice

Multicollinearity Features

What is a good VIF value? The higher the value, the greater the correlation of the variable with other variables. Values of more than 4 or 5 are sometimes regarded as being moderate to high, with values of 10 or more being regarded as very high.

Preprocessing Testing Data

Data Wrangling

Each data project requires a unique approach to ensuring that its final data set is reliable, accessible, and easier to analyze.

General Properties

Observations from above dataset are:

    Looking at the displayed information that represents the training data for predicting house prices, we notice that there are many data missing for some variables, which require processing or deletion.</li>

Numerical and Categorical features

Data Cleaning

Information That We Need To Delete Or Modify

  1. Exploring the missing values
  2. Handling missing values
  3. Check duplicate rows from the dataset
  4. Remove the unused colums that are not needes in the analysis process.

1- Exploring the missing values

  • Using a heatmap to quickly identify missing data in each variable, missing data is checked and identified on variables that need to be processed or deleted.
  • The number of missing data in each variable in the training data
  • We can quantify the missing data in each variable in a more explicit manner using a bar chart.
  • It is also possible to indicate the percentages of missing data for each variable.

Observations from above dataset are:

    Each variable (PoolQC, MiscFeature, Alley, Fence) contains approx more than 80% missing data, which means that keeping them will not benefit the analysis with anything. On the contrary, if they are processed and kept, they may lead to misleading results, so we will, in the next step, delete them.</li>

2- Handling missing values

  • Drop variables that contain missing values over 80%
  • Filling Missing Values with (Mean - Mod).
  • Drop variables that contain missing values over 80%.
  • The number of missing data in each variable in the training data After deleting the columns of the variables (PoolQC, MiscFeature, Alley, Fence), Because they contain more than 80% missing data.
  • The number of missing data in each variable in the training data
  • Filling Missing Values with (Mean - Mod) according to the variable type

3. Check duplicate rows from the dataset

4. Remove the unused colums that are not needes in the analysis process.

Concatenate Data

Dummy Variable

Removing Id column

Split Data (Train / Tast)

Machine Learning

Regression Models and Comparison of Results

Linear Regression

Model Evaluation

Ridge Regression

Model Evaluation

Lasso Regression

Model Evaluation

Elastic net Logistic Regression

Model Evaluation

Stochastic Gradient Descent (SGD Regressor)

Linear model fitted by minimizing a regularized empirical loss with SGD. SGD stands for Stochastic Gradient Descent: the gradient of the loss is estimated each sample at a time and the model is updated along the way with a decreasing strength schedule (aka learning rate). The regularizer is a penalty added to the loss function that shrinks model parameters towards the zero vector using either the squared euclidean norm L2 or the absolute norm L1 or a combination of both (Elastic Net).

Model Evaluation

Decision Tree Regressor

Model Evaluation

Random Forest Regressor

Model Evaluation

K Nearest Neighbors - Regression (KNN)

Model Evaluation

Gaussian process regression

Model Evaluation

LightGBM for Quantile Regression (LightGBM)

Model Evaluation

XGBoost for Regression (XGB)

Model Evaluation

Comparison plot: RMSE of all Regression Techniques without SGD

Comparison plot: R2 of all Regression Techniques without SGD

Correlation of model results

mean of best models